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Abstract. We propose a protocol for secure mining of association rules 
in horizontally distributed databases. The current leading protocol is 
that of Kantarcioglu and Clifton [12] . Our protocol, like theirs, is based 
on the Fast Distributed Mining (FDM) algorithm of Cheung et al. [BJ, 
which is an unsecured distributed version of the Apriori algorithm. The 
main ingredients in our protocol are two novel secure multi-party algo- 
rithms — one that computes the union of private subsets that each of 
the interacting players hold, and another that tests the inclusion of an 
element held by one player in a subset held by another. Our protocol 
offers enhanced privacy with respect to the protocol in |12j . In addition, 
it is simpler and is significantly more efficient in terms of communication 
rounds, communication cost and computational cost. 
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1 Introduction 

We study here the problem of secure mining of association rules in horizontally 
partitioned databases. In that setting, there are several sites (or players) that 
hold homogeneous databases, i.e., databases that share the same schema but hold 
information on different entities. The goal is to find all association rules with 
given minimal support and confidence levels that hold in the unified database, 
while minimizing the information disclosed about the private databases held by 
those players. 

That goal defines a problem of secure multi-party computation. In such prob- 
lems, there are M players that hold private inputs, x%, . . . , Xm, and they wish to 
securely compute y = f(xi, . . . , xm) for some public function /. If there existed 
a trusted third party, the players could surrender to him their inputs and he 
would perform the function evaluation and send to them the resulting output. 
In the absence of such a trusted third party, it is needed to devise a protocol 
that the players can run on their own in order to arrive at the required output 
y. Such a protocol is considered perfectly secure if no player can learn from his 
view of the protocol more than what he would have learnt in the idealized setting 
where the computation is carried out by a trusted third party. Yao |21) was the 
first to propose a generic solution for this problem in the case of two players. 
Other generic solutions, for the multi-party case, were later proposed in [214110] . 
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In our problem, the inputs are the partial databases, and the required out- 
put is the list of association rules with given support and confidence. As the 
above mentioned generic solutions rely upon a description of the function / as 
a Boolean circuit, they can be applied only to small inputs and functions which 
are realizable by simple circuits. In more complex settings, such as ours, other 
methods are required for carrying out this computation. In such cases, some re- 
laxations of the notion of perfect security might be inevitable when looking for 
practical protocols, provided that the excess information is deemed benign (see 
examples of such protocols in e.g. [12120123] ). 

Kantarcioglu and Clifton studied that problem in [T3] and devised a protocol 
for its solution. The main part of the protocol is a sub-protocol for the secure 
computation of the union of private subsets that are held by the different play- 
ers. (Those subsets include candidate itemsets, as we explain below.) That is 
the most costly part of the protocol and its implementation relies upon crypto- 
graphic primitives such as commutative encryption, oblivious transfer, and hash 
functions. This is also the only part in the protocol in which the players may 
extract from their view of the protocol information on other databases, beyond 
what is implied by the final output and their own input. While such leakage of 
information renders the protocol not perfectly secure, the perimeter of the excess 
information is explicitly bounded in |12] and it is argued that such information 
leakage is innocuous, whence acceptable from practical point of view. 

Herein we propose an alternative protocol for the secure computation of the 
union of private subsets. The proposed protocol improves upon that in |12) in 
terms of simplicity and efficiency as well as privacy. In particular, our proto- 
col does not depend on commutative encryption and oblivious transfer (what 
simplifies it significantly and contributes towards reduced communication and 
computational costs). While our solution is still not perfectly secure, it leaks 
excess information only to a small number of coalitions (three), unlike the pro- 
tocol of [T2] that discloses information also to some single players. In addition, 
we claim that the excess information that our protocol may leak is less sensitive 
than the excess information leaked by the protocol of [T2] . 

The protocol that we propose here computes a parameterized family of func- 
tions, which we call threshold functions, in which the two extreme cases corre- 
spond to the problems of computing the union and intersection of private subsets. 
Those are in fact general-purpose protocols that can be used in other contexts 
as well. Another problem of secure multi-party computation that we solve here 
as part of our discussion is the problem of determining whether an element held 
by one player is included in a subset held by another. 

1.1 Preliminaries 

Let D be a transaction database. As in [12] . we view D as a binary matrix 
of N rows and L columns, where each row is a transaction over some set of 
items, A = {ai, . . . , at}, and each column represents one of the items in A. (In 
other words, the (i,j)th entry of D equals 1 if the ith transaction includs the 
item aj, and otherwise.) The database D is partitioned horizontally between 
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M players, denoted Pi, ... , Pm- Player P m holds the partial database D m that 
contains N m = \D m \ of the transactions in D, 1 < m < M. The unified database 
is D = D x U • • • U D M , and N = Y%=i N m- 

An itemset A" is a subset of A. Its global support, supp(X), is the number of 
transactions in D that contain it. Its local support, supp m {X) , is the number of 
transactions in D m that contain it. Clearly, supp(X) = J2 m =i su PPm(X). Let s 
be a real number between and 1 that stands for a required threshold support. 
An itemset X is called s-frequent if supp(X) > sN. It is called locally s-frequent 
at D m if supp m (X) > sN m . 

For each 1 < k < L, let denote the set of all fc-itemsets (namely, item- 
sets of size k) that are s-frequent, and F k,m be the set of all fc-itemsets that 
are locally s-frequent at D m , 1 < m < M. Our main computational goal is to 
find, for a given threshold support < s < 1, the set of all s-frequent itcmscts, 
F s :— ljfc =1 Fg. We may then continue to find all (s, c)-association rules, i.e., all 
association rules of support at least sA^ and confidence at least c. (Recall that if 
X and Y are two disjoint subsets of A, the support of the corresponding associ- 
ation rule X =^> Y is supp(X U Y) and its confidence is supp(X U Y)/supp(X).) 

1.2 The Fast Distributed Mining algorithm 

The protocol of [12], as well as ours, are based on the Fast Distributed Mining 
(FDM) algorithm of Cheung et al. [5] , which is an unsecured distributed version 
of the Apriori algorithm. Its main idea is that any s-frequent itemset must be 
also locally s-frequent in at least one of the sites. Hence, in order to find all 
globally s-frequent itemsets, each player reveals his locally s-frequent itemsets 
and then the players check each of them to see if they are s-frequent also globally. 
The stages of the FDM algorithm are as follows: 

(1) Initialization: It is assumed that the players have already jointly calculated 
Fg 1 . The goal is to proceed and calculate F s fe . 

(2) Candidate Sets Generation: Each P m generates a set of candidate k- 
itemsets B k s - m out of F*~ l ' m n P s fc_1 — the (k - l)-itemsets that are both 
globally and locally frequent, using the Apriori algorithm. 

(3) Local Pruning: For each X £ B^' m , P m computes supp m (X) and retains 
only those itemsets that are locally s-frequent. We denote this collection of 
itemsets by C^' m . 

(4) Unifying the candidate itemsets: Each player broadcasts his C^' m and 
then all players compute C k :— U m =i C«' m - 

(5) Computing local supports. All players compute the local supports of all 
itemsets in C*. 

(6) Broadcast Mining Results: Each player broadcasts the local supports 
that he computed. From that, everyone can compute the global support of 
every itemset in C*. Finally, F* is the subset of that consists of all 
globally s-frequent fc-itemsets. 
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1.3 Overview and organization of the paper 

The FDM protocol violates privacy in two stages: In Stage 4, where the play- 
ers broadcast the itemsets that are locally frequent in their private databases, 
and in Stage 6, where they broadcast the sizes of the local supports of candi- 
date itemsets. Kantarcioglu and Clifton [12] proposed secure implementations 
of those two stages. Our improvement is with regard to the secure implemen- 
tation of Stage 4, which is the more costly stage of the protocol, and the one 
in which the protocol of [T2] leaks excess information. In Section [5] we describe 
[T2"]'s secure implementation of Stage 4. We then describe our alternative im- 
plementation and we proceed to compare the two implementations in terms of 
privacy of efficiency. In the next two short Sections |3] and |4] we describe briefly, 
for the sake of completeness, [H]'s implementation of the two remaining stages 
of the distributed protocol: The identification of those candidate itemsets that 
are globally s-frequent, and then the derivation of all (s, c)-association rules. 
Section [5] includes a review of related work. We conclude the paper in Section El 
Like in [T3] we assume that the players are semi-honest; namely, they follow 
the protocol but try to extract as much information as possible from their own 
view. (See |11I18I23| for a discussion and justification of that assumption.) We 
too, like [T2], assume that M > 2. (The case M = 2 is discussed in [T^J Section 
5] ; the conclusion is that the problem of secure computation of frequent itemsets 
and association rules in the two-party case is unlikely to be of use.) 

2 Secure computation of all locally frequent itemsets 

Here we discuss the secure computation of the union C$ = Um=i C^' m - We de- 
scribe the protocol of [T2] (Section l2.ip and then our protocol (Sections I2.2H2.3[ ). 
We analyze the privacy of the two protocols in Section [2.4[ their communication 
cost in Section [231 and their computational cost in Section [2~B1 

2.1 The protocol of [12j for the secure computation of all locally 
frequent itemsets 

Protocol[T](UNiFI-KC hereinafter) is the protocol that was suggested by Kantar- 
cioglu and Clifton [12] for computing the unified list of all locally frequent item- 
sets, = Um=i C"s ' m j without disclosing the sizes of the subsets C^' m nor their 
contents. It is based on two ideas: Hiding the sizes of the subsets ,m by means 
of fake itemsets, and hiding their content by means of encryption. Let Ap(F^ 1 ) 
denote the set of fc-itemsets which the Apriori algorithm generates when ap- 
plied on F*" 1 . Clearly, C*' m C Ap(F^~ 1 ), whence |C* s fc < m | < \Ap(F*~ l )\, for all 
1 < m < M. Therefore, after each player computes C^ ,m , he adds to it fake 
itemsets until its size becomes \Ap(F^~ 1 )\ in order to hide the number of locally 
frequent itemsets that he has. In order to hide the actual itemsets, they use 
a commutative encryption algorithm. (A commutative encryption means that 
Ek 1 °Ek 2 = Ek 2 oEkx for any pair of keys K\ and K%.) We proceed to describe 
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Protocol 1 (UniFI-KC) Unifying lists of locally Frequent Itemsets [T2] 

Input: Each player P m has an input set C k ' m C Ap(F k ~ 1 ), 1 < m < M. 
Output: C k a = (j£ =1 C k s ' m . 
1: Phase 0: Getting started 

2: The players decide on a commutative cipher and each player P m , 1 < m < M, 

selects a random secret encryption key K m . 
3: The players select a hash function h and compute h(x) for all x G Ap(F k ~ 1 ). 
4: If there exist ii / 12 £ Ap{F k ~ 1 ) for which h(xi) = h(x2), select a different h. 
5: Build a lookup table T = {(x, h(x)) : x G Ap(F k ~ 1 )}. 
6: Phase 1: Encryption of all itemsets 
7: for all Player P m , 1 < m < A/, do 
8: Set X m = 0. 
9: for all x G C" s fc ' m do 

10: Player P m computes EK m (h(x)) and adds it to X m . 
11: end for 

12: Player P m adds to X m faked itemsets until its size becomes \ Ap(F k ~ 1 )\. 

13: end for 

14: for i = 2 to M do 

15: P m sends a permutation of X m to P m +i. 
16: P m receives from P m -i the permuted X m -i- 

17: Pm computes a new X m as the encryption of the permuted X m -\ using K m - 
18: end for 

19: Phase 2: Merging itemsets 

20: Each odd player sends his encrypted sets to player Pi. 
21: Each even player sends his encrypted sets to player P2. 

22: Pi unifies all sets that were sent by the odd players and removes duplicates. 
23: P2 unifies all sets that were sent by the even players and removes duplicates. 
24: P2 sends his permuted list of itemsets to Pi . 

25: Pi unifies his list of itemsets and the list received from P2 and then removes 

duplicates from the unified list. Denote the final list by EC k ■ 
26: Phase 3: Decryption 
27: for m = 1 to M - 1 do 

28: Pm decrypts all itemsets in EC k using K m . 

29: P m sends the permuted (and T^m-decrypted) EC k to P m +i. 

30: end for 

31: Pm decrypts all itemsets in EC k using Em', denote the resulting set by C k . 
32: Pm uses the lookup table T to remove from C k faked itemsets. 
33: Pm broadcasts C a . 



the protocol. Since all protocols that we present here involve cyclic communica- 
tion rounds, Pm+i always means Pi, while Pq means Pm- 

In the preliminary Phase (Steps 2-5) the players select the needed cryp- 
tographic primitives: They jointly select a commutative cipher, and each player 
selects a corresponding random private key. In addition, they select a hash func- 
tion h to apply on all itemsets prior to encryption. It is essential that h will not 
experience collusions on Ap(F^~ 1 ) in order to make it invertible on Ap(F^~ 1 ). 
Hence, if such collusions occur (an event of a very small probability), a different 
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hash function must be selected. At the end, the players compute a lookup table 
with the hash values of all candidate itemsets in Ap(Fg~ 1 ), which will be used 
later on to find the preimage of a given hash value. 

In Phase 1, all players compute a composite encryption of the hashed sets 
C s fc ' m , 1 < m < M. First (Steps 7-13), each player P m hashes all itemsets in C*' m 
and then encrypts them using the key K m . (Hashing is needed in order to prevent 
leakage of algebraic relations between itemsets, see Appendix].) Then, he 
adds to the resulting set faked itemsets until its size becomes \Ap{Fg~ l )\. We 
denote the resulting set by X m . Then (Steps 14-18), the players start a loop of 
M — 1 cycles, where in each cycle they perform the following operation: Player 
P m send a permutation of X m to the next player P m +i; Player P m receives from 
P m -i a permutation of the set X m _i and then computes a new X rn as X rn = 
E}( m (X m -i). At the end of this loop, P m holds an encryption of the hashed 
Qk,m+i us j n g a \\ m keys. Due to the commutative property of the selected 
cipher, Player P m holds the set {E M {- ■ ■ (E 2 (E 1 (h(x)))) ■ ■ •) : x e C*^ m+1 }. 

In Phase 2 (Steps 20-25), all odd players send the sets that they have to Pi, 
who unifies them. Consequently, Pi will hold all hashed and encrypted itemsets 
of the even players. Similarly, all even players send the sets that they have to 
P2, who also unifies them; hence, Pi will hold all hashed and encrypted itemsets 
of the odd players. At this stage, both Pi and P2 remove duplicates, as those 
stand (with high probability) for itemsets that were frequent in more than one 
site. Then, player Pi permutes his list and sends it to Pi who unifies it with the 
list that he got. Therefore, at the completion of this stage Pi holds the union set 
Cg — Um=i Cg'" 1 hashed and then encrypted by all encryption keys, together 
with some fake itemsets that were used for the sake of hiding the sizes of the 
sets C^' m ; those fake itemsets are not needed anymore and will be removed after 
decryption in the next phase. 

In phase 3 (Steps 27-33) a similar round of decryptions is initiated. At the 
end, the last player who performs the last decryption uses the lookup table T 
that was constructed in Step 5 in order to identify and remove the fake itemsets 
and then to recover Cg. Finally, he broadcasts to all his peers. 

2.2 A secure multiparty protocol for computing the OR 

Protocol UniFI-KC securely computes of the union of private subsets of some 
publicly known ground set (Ap(F^~ 1 )). Such a problem is equivalent to the 
problem of computing the OR of private vectors. Indeed, if we let n denote the 
size of the ground set, then the private subset that player P m , 1 < m < M, 
holds may be described by a binary vector b m G Zj, and the union of the 
private subsets is described by the OR of those private vectors, b := Vm=i ^m- 
We present here a protocol for computing that function which is simpler than 
UniFI-KC and employs less cryptographic primitives. 

The protocol that we present (Protocol [5]) computes a wider range of func- 
tions, which we call threshold functions. 
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Definition 1. Let b\, ... , 6m be M bits and 1 < t < M be an integer. Then 

{ 1 ¥ E!=l b m > t 

T t (b 1 ,...,b M ) = i (1) 

U if J2^=lb,n<t 

is called the t -threshold function. Given binary vectors h m = (b m (l), . . . , b m {n)) G 
Z2 , we let T t (bi, . . . , dm) denote the binary vector in which the ith component 
equals T t (bi(i), . . . , &m(*)). 1 < i < n. 

The OR and AND functions are the 1- and M-threshold functions, respectively; 
i.e., 

M M 

\f b m = Ti(bi, . . . ,b M ) , and /\ b m = T M (bx, . . . , b M ) ■ 

ra— 1 m— 1 

Those special cases may be used, as we show in Section 12.31 to compute in a 
privacy-preserving manner unions and intersections of subsets. 

The main idea behind Protocol [2] (Threshold henceforth) , which is based 
on the protocol suggested in [5] for secure computation of the sum, is to compute 
shares of the sum vector and then use those shares to securely verify the threshold 
conditions in each component. Since the sum vector may be seen as a vector over 
Zm+i, each player starts by creating random shares in 2^ f+1 of his input vector 
(Step 1). In Step 2 all players send to all other players the corresponding shares 
in their input vector. Then (Step 3), player Pe, 1 < £ < M, adds the shares that 
he got and arrives at his share, Sf, in the sum vector. Namely, if we let a := 
Em=i km denote the sum of the input vectors, then a = J2i=i s ? m °d (M + 1); 
furthermore, any M — l vectors out of {si, . . . , sm} do not reveal any information 
on the sum a. In Steps 4-5, all players, apart from the last one, send their shares 
to Pi who adds them up to get the share s. Now, players Pi and Pm hold 
additive shares of the sum vector a: Pi has s, Pm has sm, and a = (s + sm) 
mod (M + 1). It is now needed to check for each component 1 < i < n whether 
(s(i) + sm (*)) mod (M + 1) < t. Equivalently, we need to check whether 

(*(*) + mod (M + 1) e {j : < 3 < t - 1} . (2) 

The inclusion in ^ is equivalent to 

s(i) e Oil) := {(j - s M {i)) mod (M + 1) : < 3 < t - 1} . (3) 

The value of s(i) is known only to Pi while the set 0(i) is known only to Pm- 
The problem of verifying the set inclusion in Eq. ([3]) can be seen as a simplified 
version of the privacy-preserving keyword search, which was solved by Freedman 
et. al. [15]. In the case of the the OR function, t = 1, which is the relevant case 
for us, the set 0{i) is of size 1, and therefore it is the problem of oblivious string 
comparison, a problem that was solved in e.g. [5]. However, we claim that, since 
M > 2, there is no need to invoke neither of the secure protocols of [TS] or 
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Protocol 2 (Threshold) Secure computation of the i-thrcshold function 
Input: Each player P m has an input binary vector b m £ Z£ , 1 < m < M. 
Output: b:=T t (bi,...,b M ). 

1: Each P m selects M random share vectors h m ,e G ^m+1j 1 < ^ < M, such that 
J2tLi bm,( = b m mod [M + 1). 

2: Each P m sends b OTj £ to P e for all 1 < I m < M. 

3: Each P t computes s e = (s e (1), . . . , si(n)) := J2 m =i b ™,^ mod i M + 1 )- 

4: Players Pi, 2 < £ < M - 1, send s £ to Pi. 

5: Pi computes s = (s(l), . . . , s(n)) := J]^ 1 s £ mod (A/ + 1). 

6: for i = 1, . . . , n do 

7: If (s(i) + s M (i)) mod (M + 1) < t set b(i) = otherwise set b(i) = 1. 
8: end for 

9: Output b= (6(1),..., b(n)). 



[9]. Indeed, as M > 2, the existence of other semi-honest players can be used 
to verify the inclusion in Eq. ([3]) much more easily. This is done in Protocol [3] 
(SetInc) which we proceed to describe next. 

Protocol SetInc starts with players Pi and Pm agreeing on a keyed hash 
function fiK(-) (e.g., HMAC [3]), a corresponding secret key K, and a long 
random bit string r (Steps 1-2). Then (Steps 3-4), Pi converts his sequence 
of elements s = (s(l), . . . ,s(n)) into a sequence of corresponding "signatures" 
s' = (s'(l), . . . , s'(n)), where s'(i) — hx(r,i, s(i)) and Pm does a similar con- 
versions to the subsets that he holds. Pi then sends s' to P2 and Pm sends to 
P2 a random permutation of each of the subsets 0'(i), 1 < i < n. Finally, P2 
performs the relevant inclusion verifications on the signature values. If he finds 
out that for a given 1 < i < n, s'(i) £ <9'(i), he may infer, with high probability, 
that s(i) G 0(i), whence he sets b(i) — 0. If, on the other hand, s'(i) ^ 
then, with certainty, s(i) ^ 0(i), whence he sets b(i) = 1. 

Two comments are in order: 

(1) If the index i had not been part of the input to the hash function (Steps 
3-4), then two equal components in Pi's input vector, say s(i) = s(j), would 
have been mapped to two equal signatures, s'(i) = s'(j'). Hence, in that 
case player P2 would have learnt that in Pi's input vector the ith and jth 
components are equal. To prevent such leakage of information, we include 
the index i in the input to the hash function. 

(2) An event in which s'(i) 6 0'{i) while s(i) ^ indicates a collusion; 
specifically, it implies that there exist 9' £ 0{i) and 9" G Q \ 0(i) for 
which fiK(r,i,9') — fiK{r,i,9"). Hash functions are designed so that the 
probability of such collusions is negligible, whence the risk of a collusion 
can be ignored. However, it is possible for player Pm to check upfront the 
selected random values (K and r) in order to verify that for all 1 < i < n, the 
sets G'{i) = {h K (r, i,9) : 9 e 0{i)} and 0"(i) = {h K (r, i, 6) : 6 G D \ 0(i)} 
are disjoint. 



Mining of Association Rules in Horizontally Distributed Databases 



9 



Protocol 3 (SetInc) Set Inclusion computation 

Input: Pi has a vector s = (s(l), . . . , s(n)) and Pm has a vector = (0(1), . . . , 0(n)), 

where for all 1 < i < n, s(i) G fl and 0(i) C fi for some ground set fl. 
Output: The vector b = (6(1), . . . ,6(n)) where b(i) = if s(i) G 0(i) and b(i) = 1 
otherwise, 1 < i < n. 
1: Pi and Pm agree on a keyed-hash function hx (•) and on a secret key A". 
2: Pi and Pm agree on a long random bit string r. 

3: Pi computes s' = (s'(l), . . . , s'(n)), where s'(i) = (r, i, s(j)), 1 < i < n. 

4: P M computes 0' = (©'(1), . . . , 6>'(n)), where ©'(i) = {h K (r,i,6) : 9 G 6>(i)}, 

1 < i < n. 
5: Pi sends to P2 the vector s'. 

6: Pm sends to P2 the vector 0' in which each 0(i) is randomly permuted. 
7: For all 1 < i < n, P 2 sets b(i) = if s'(i) G ©'(i) and b(i) = 1 otherwise. 
8: P2 broadcasts the vector b = (6(1), . . . , 6(n)). 



We refer hereinafter to the combination of Protocols Threshold and Set- 
Inc as Protocol Threshold- C; namely, it is Protocol Threshold where the 
inequality verifications in Steps 6-8 are carried out by Protocol SetInc. Then 
our claims are as follows: 

Theorem 1. Assume that the M > 2 players are semi-honest and that the keyed 
hash function in Protocol SetInc is preimage-resistant. Then: 

(a) Protocol Threshold-C is correct (i.e., it computes the threshold function) . 

(b) Let C C {Pi, Pi, ■ ■ ■ , Pm} be a coalition of players. 

(i) If Pi C and at least one of Pi and Pm is not in C either, then 

Protocol Threshold- C is perfectly private with respect to C. 
(ii) If Pi G C but P\,Pm C, the protocol is computationally private with 
respect to C . 

(in) Otherwise, C includes at least two of the players Pi, Pi, Pm; such coali- 
tions may learn the sum a = Xim^i^m, but no further information 
beyond the sum. 

Proof. 

(a) Protocol Threshold operates correctly if the inequality verifications in 
Step 7 are carried out correctly, since (s(i) + Sm(«)) mod (M + 1) equals the 
ith component a(i) in the sum vector a = J2m=i ^m- The inequality verification 
is correct if Protocol SetInc is correct. The latter protocol is indeed correct 
if the randomly selected K and r are such that for all 1 < i < n, the sets 
0'(i) = {h K (r,i,6) : 9 e 0(i)} and 0"(i) = {h K (r,i,6) : 9 G O\0(i)} are 
disjoint. (As discussed earlier, such a verification can be carried out upfront, 
and most all selections of K and r are expected to pass that test.) 

(b) Any single player Pg, 1 < I < M, learns in the course of the protocol his 
share of the sum a in a M-out-of-M secret sharing scheme for a (see Step 
3 in Protocol Threshold). Two players learn more information: Pi receives 
the shares s 2 , . . . ,sm-i (Step 4 in Protocol Threshold) and Pi receives the 
signatures s' and 0' during Protocol SetInc. 
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(i) If P2, Pi £ C then the players in C have, at most, the shares S3, . . . , Sm ■ 
Since the secret sharing scheme is perfect, any number of shares which is 
less than M reveals no information on a, since the missing shares were 
chosen at random. If P2 , Pm ^ C, then the worst scenario is that in which 
Pi E C; in that case, the coalition knows the shares Si,... , Sm-i- Once 
again, as the missing share sm was chosen at random by Pm, the shares 
Si, . . . , Sm-i reveal no information on a. 

(ii) If P 2 E C and Pi , Pm £ C, then C has at most the M — 2 additive shares 
S2, . . • , Sm-i- The additional knowledge that Pi holds enables, in theory, 
the recovery of the missing shares si and sm and then the recovery of 
a = Y^tLi s i- Indeed, by scanning all possible keys K of the keyed hash 
function and all possible random strings r, P2 may find a key K and a 
string r for which the signature values that he got from Pi and Pm (namely, 
s' and 0') are consistent with the signature scheme and the elements of 
fi = {0, 1, ... , M}. Hence, the protocol does not provide perfect privacy in 
the information-theoretic sense with respect to such coalitions. However, 
since such a computation is infeasible, and as the hash function is preimage- 
resistant, the protocol provides computational privacy with respect to such 
coalitions. 

(iii) If Pi , Pm G C, then by adding s (known to Pi ) and sm (known to Pm ) , they 
will get the sum a. No further information on the input vectors bi, . . . , \>m 
may be deduced from the inputs of the players in such a coalition; specif- 
ically, every set of vectors t>i, . . . , bjf that is consistent with the sum a is 
equally likely. Coalitions C that include either Pi , P2 or P2 , Pm can also 
recover a. Indeed, P2 knows s' and 0' and Pi or Pm knows Hk, K and 
r. Hence, a coalition of P2 with either Pi or Pm may recover from those 
values the preimages s and 0. Hence, such a coalition can recover s and 
sm, and consequently a. As argued before, the shares available for such 
coalitions do not reveal any further information about the input vectors 
bi, . . . , \>m- 

□ 

The susceptibility of Protocol Threshold- C to coalitions is not very signif- 
icant because of two reasons: 



— The entries of the sum vector a do not reveal information about specific input 
vectors. Namely, knowing that a(i) = p only indicates that p out of the M 
bits b m (i), 1 < m < M, equal 1, but it reveals no information regarding 
which of the M bits are those. 

— There are only three players that can collude in order to learn information 
beyond the intention of the protocol. Such a situation is far less severe than 
a situation in which any player may participate in a coalition, since if it is 
revealed that a collusion took place, there is a small set of suspects. 
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2.3 An improved protocol for the secure computation of all locally 
frequent itemsets 

As before, we denote by i 7 ^ -1 the set of all globally frequent (k — l)-itemsets, and 
by Ap(Fg ) the set of fc-itemsets that the Apriori algorithm generates when 
applied on F^ 1 . All players can compute that set and decide on an ordering of it. 
(Since all itemsets are subsets of A — {oi, . . . , Ol}, they may be viewed as binary 
vectors in {0, 1} L and, as such, they may be ordered lexicographically.) Then, 
since the sets of locally frequent fc-itemsets, C^ ,m , 1 < m < M, are subsets of 
Ap(Fg 1 ), they may be encoded as binary vectors of length := \Ap(Fg 1 )\. 
The binary vector that encodes the union 

C k s ■= Um=i c s' m is the OR of the 
vectors that encode the sets ' m , 1 < m < M. Hence, the players can compute 
the union by invoking Protocol Threshold-C on their binary input vectors. 
This approach is summarized in Protocol H] (UniFI). (Replacing T\ with Tm in 
Step 2 will result in computing the intersection of the private subsets.) 



Protocol 4 (UniFI) Unifying lists of locally Frequent Itemsets 

Input: Each player P m has an input subset C k,m C Ap{F k ~ 1 ), 1 < m < M. 
Output: C k s = (j£ =1 C k s ' m . 

1: Each player P m encodes his subset C k ' m as a binary vector b m of length nu = 
\Ap(F k ~ 1 )\, in accord with the agreed ordering of Ap(F k ~ 1 ). 

2: The players invoke Protocol Threshold-C to compute b = Ti(bi, . . . , bjif) = 

Vjn=l D "»- 

3: C k is the subset of Ap(Fl ) that is described by b. 



2.4 Privacy 

We begin by analyzing the privacy offered by Protocol UniFI-KC. That proto- 
col does not respect perfect privacy since it reveals to the players information 
that is not implied by their own input and the final output. In Step 11 of Phase 
1 of the protocol, each player augments the set X m by fake itemsets. To avoid 
unnecessary hash and encryption computations, those fake itemsets are random 
strings in the ciphertext domain of the chosen commutative cipher. The proba- 
bility of two players selecting random strings that will become equal at the end 
of Phase 1 is negligible; so is the probability of Player P m to select a random 
string that equals Eic m (h(x)) for a true itemset x £ Ap{F^~ 1 ). Hence, every 
encrypted itemset that appears in two different lists indicates with high proba- 
bility a true itemset that is locally s-frequent in both of the corresponding sites. 
Therefore, Protocol UniFI-KC reveals the following excess information: 

(1) Pi may deduce for any subset of the even players, the number of itemsets 
that are locally supported by all of them. 

(2) P-2 may deduce for any subset of the odd players, the number of itemsets 
that are locally supported by all of them. 



12 T. Tassa 



(3) Pi may deduce the number of itemsets that are supported by at least one 
odd player and at least one even player. 

(4) If Pi and P 2 collude, they reveal for any subset of the players the number 
of itemsets that are locally supported by all of them. 

As for the privacy offered by Protocol UniFI, we consider two cases: If there 
are no collusions, then, by Theorem [TJ Protocol UniFI offers perfect privacy 
with respect to all players P m , m ^ 2, and computational privacy with respect 
to P2. This is a privacy guarantee better than that offered by Protocol UniFI- 
KC, since the latter protocol does reveal information to P\ and P2 even if they 
do not collude with any other player. 

If there are collusions, both Protocols UniFI-KC and UniFI allow the col- 
luding parties to learn forbidden information. In both cases, the number of "sus- 
pects" is small — in Protocol UniFI-KC only Pi and P2 may benefit from a 
collusion while in Protocol UniFI only Pi, Pi and Pm can extract additional 
information if two of them collude (see Theorem[T]). In Protocol UniFI-KC, the 
excess information which may be extracted by Pi and P2 is about the number of 
common frequent itemsets among any subset of the players. Namely, they may 
learn that, say, P2 and P3 have many itemsets that are frequent in both of their 
databases (but not which itemsets), while P2 and P4 have very few itemsets 
that arc frequent in their corresponding databases. The excess information in 
Protocol UniFI is different: If any two out of Pi, P2 and Pm collude, they can 
learn the sum of all private vectors. That sum reveals for each specific itemset in 
Ap(F^~ 1 ) the number of sites in which it is frequent, but not which sites. Hence, 
while the colluding players in Protocol UniFI-KC can distinguish between the 
different players and learn about the similarity or dissimilarity between them, 
Protocol UniFI leaves the partial databases totally indistinguishable, as the 
excess information that it leaks is with regard to the itemsets only. 

To summarize, given that Protocol UniFI reveals no excess information when 
there are no collusions, and, in addition, when there are collusions, the excess 
information still leaves the partial databases indistinguishable, it offers enhanced 
privacy preservation in comparison to Protocol UniFI-KC. 

2.5 Communication cost 

Here and in the next section we analyze the communication and computational 
costs of Protocols UniFI-KC and UniFI. In doing so, we let n k := \Ap(F^~ 1 )\ 
and n := X)fc=2 n k\ also, the fcth iteration refers to the iteration in which P s fc is 
computed from F^ 1 (2 < k < L). 

We start with Protocol UniFI-KC. Let t denote the number of bits required 
to represent an itemset. Clearly, t must be at least log 2 for all 2 < k < L. 
However, as Protocol UniFI-KC hashes the itemsets and then encrypts them, t 
should be at least the recommended ciphertext length in commutative ciphers. 
RSA 17 , Pohlig-Hellman [TB] and ElGamal [7 ciphers are examples of commu- 
tative encryption schemes. As the recommended length of the modulus in all of 
them is at least 1024 bits, we take t — 1024. 
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During Phase 1 of Protocol UniFI-KC, there are M — 1 rounds of communi- 
cation, where in each one of them each of the M players sends to the next player 
a message; the length of that message in the fcth iteration is tn k . Hence, the 
communication cost of this phase in the fcth iteration is [M — l)Mtn k . During 
Phase 2 of the protocol all odd players send their encrypted itemsets to Pi and 
all even players send their encrypted itemsets to P 2 . Then P2 unifies the item- 
sets he got and sends them to Pi. Hence, this phase takes 2 more rounds and 
its communication cost in the fcth iteration is bounded by 1.5Mtn k - In the last 
phase a similar round of decryptions is initiated. The unified list of all encrypted 
true and fake itemsets may contain in the fcth iteration up to Mn k itemsets. 
Hence, that phase involves M — 1 rounds with communication cost of no more 
than [M — l)Mtn k - To summarize: Protocol UniFI-KC entails 2M communi- 
cation rounds (in each of the iterations) and the communication cost in the fcth 
iteration is 0{M 2 tn k ). (In fact, since the majority of the itemsets are expected 
to be fake itemsets, the communication cost in the decryption phase is close to 
(M — l)Mtn k and then the overall communication cost is roughly 2M 2 tn k .) 

We now turn to analyze the communication cost of Protocol UniFI. It con- 
sists of 3 communication rounds: One for Step 2 of Protocol Threshold that it 
invokes, one for Step 4 of that protocol, and one for the threshold verifications 
in Steps 6-8 (in which Protocol SetInc is invoked). Hence, it improves upon 
Protocol UniFI-KC that entails 2M communication rounds. 

In the fcth iteration, the length of the vectors in Protocol Threshold is n k ; 
each entry in those vectors represents a number between and M — 1, whence 
it may be encoded by log 2 M bits. Therefore: 

- The communication cost of Step 2 in Protocol Threshold is M(M - 
l)(log 2 M)n k bits. 

- The communication cost of Step 4 in Protocol Threshold is (M-2)(log 2 M)n k 
bits. 

- Steps 6-8 in Protocol Threshold are carried out by invoking Protocol Set- 
Inc. The communication cost of Steps 5-6 in the latter protocol is 2\h\n k , 
where \h\ is the size in bits of the hash function's output. (Recall that when 
Protocol SetInc is called in the framework of Protocol Threshold-C, the 
size of each of the sets 0{i) is 1.) 

Hence, the overall communication cost of Protocol Threshold-C in the fcth 
iteration is ((M 2 - 2)(log 2 M) + 2\h\)n k bits. 

As discussed earlier, a plausible setting of t would be t — 1024. A typical 
value of \h\ is 160. With those parameters, the improvement factor in terms 
of communication cost, as offered by Protocol UniFI with respect to Protocol 
UniFI-KC, is approximately 

2MHn k _ /log 2 M \ _1 _/log 2 M 0.15625V 1 

(M 2 log 2 M + 2\h\)n k ~ \ 2t + MH J ~ \ 2048 + M 2 J 

For M — 4 we get an improvement factor of roughly 93, while for M = 8 we get 
a factor of about 256. 
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2.6 Computational cost 

In Protocol UniFI-KC each of the players needs to perform hash evaluations 
as well as encryptions and decryptions. As the cost of hash evaluations is sig- 
nificantly smaller than the cost of commutative encryption, we focus on the 
cost of the latter operations. In Steps 9-11 of the protocol, player P m performs 
|(7«> m | < nk — \Ap(Fg 1 )| encryptions (in the fcth iteration). Then, in Steps 
14-18, each player performs M — 1 encryptions of sets that include items. 
Hence, in Phase 1 in the fcth iteration, each player performs between (M — l)nk 
and Mrik encryptions. In Phase 3, each player decrypts the set of items EC^. 
EC^ is the union of the encrypted sets from all M players, where each of those 
sets has nk items — true and fake ones. Clearly, the size of EC^ is at least rik- 
On the other hand, since most of the items in the M sets are expected to be 
fake ones, and the probability of collusions between fake items is negligible, it is 
expected that the size of EC^ would be close to Mnk- So, in all its iterations 
(2 < k < L), Protocol UniFI-KC requires each player to perform an overall 
number of close to 2Mn (but no less than Mn) encryptions or decryptions, 
where, as before n = ~^2 k=2 n-k- Since commutative encryption is typically based 
on modular exponentiation, the overall computational cost of the protocol is 
0(Mt 3 n) bit operations per player. 

In Protocol Threshold, which Protocol UniFI calls, each player needs to 
generate (M — l)n (pseudo)random (log 2 M)-bit numbers (Step 1). Then, each 
player performs [M — l)n additions of such numbers in Step 1 as well as in Step 
3. Player P\ has to perform also (M — 2)n additions in Step 5. Therefore, the 
computational cost for each player is <9(Mnlog 2 M) bit operations. In addition, 
Players 1 and M need to perform n hash evaluations. 

Estimating the practical gain in computation time Here, we estimate the 
practical gain in computation time, as offered by Protocol UniFI in comparison 
to Protocol UniFI-KC. We adopt the same estimation methodology that was 
used in (T2J Section 6.2]. We measured the times to perform the basic operations 
used by the two protocols on an Intel(R) Core(TM)2 Quad CPU 2.67 GHz 
personal computer with 8 GB of RAM: 

— Modular addition took (on average) 0.762 /xs (microseconds); 

— random byte generation took 0.0126 /is; 

— equality verification between two 160-bit values took 0.13 fj,s; 

— modular exponentiation with modulus of t = 1024 bits took 12251 /is; and 

— computing HMAC on an input of 512 bits took on average 15.7 /us. 

As in [T2], we estimate the added computational cost due to the secure compu- 
tations in the two protocols when implemented in the experimental setting that 
was used in [6]. In that experimental setting, the number of sites was M = 3, the 
number of items was L — 1000 and the unified database contained N — 500, 000 
transactions. Using a support threshold of s — 0.01, [B] reports that n = Y^k=2 nk 
was just over 100,000. 
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In Protocol UniFI-KC each player performs roughly 2Mn encryptions and 
decryptions. Hence, the corresponding encryption time per player is 2-3-100, 000- 
12251 fxs in this setting, i.e., approximately 7350 seconds. In comparison, Proto- 
col UniFI requires each player to generate (M — l)nlog 2 M pseudo random bits 
(50,000 bytes in this setting, which mean 630 fis), and perform (2M — 2)n addi- 
tions (400,000 modular additions in this case, which mean 0.305 seconds). P\ has 
to perform (M — 2)n additional additions (0.076 seconds); Pi and Pm need to 
perform 100,000 HMAC computations (1.57 seconds); and P2 performs 100,000 
verifications of equality between 160-bit values (0.013 seconds). The overall com- 
putational overhead for the least busy player (P2) is 0.318 seconds and for the 
busiest player (Pi) is 1.951 seconds. Hence, compared to 7350 seconds for each 
player in Protocol UniFI-KC, the improvement in computational time overhead 
is overwhelming. 

3 Identifying the globally s-frequent itemsets 

Protocol UniFI-KC yields the set that consists of all itemsets that are locally 
s-frequent in at least one site. Those are the fc-itemsets that have potential to be 
also globally s-frequent. In order to reveal which of those itemsets is globally s- 
frcqucnt there is a need to securely compute the support of each of those itemsets. 
That computation must not reveal the local support in any of the sites. Let x be 
one of the candidate itemsets in . Then x is globally s-frequent if and only if 



Kantarcioglu and Clifton considered two possible settings. If the required output 
includes all globally s-frequent itemsets, as well as the sizes of their supports, 
then the values of A(x) can be revealed for all x £ C% . In such a case, those 
values may be computed using a secure summation protocol (e.g. [5])- The more 
interesting setting, however, is the one where the support sizes are not part of 
the required output. We proceed to discuss it. 

Let q be an integer greater than 2N. Then, since |^i(x)| < N, the itemset 
x G Cg is s-frequent if and only if A(x) mod q < q/2. The idea is to verify that 
inequality by starting an implementation of the secure summation protocol of [5] 
on the private inputs A m (x) := supp m (x) — sN m , modulo q. In that protocol, all 
players jointly compute random additive shares of the required sum A{x) and 
then, by sending all shares to, say, Pi, he may add them and reveal the sum. 
If, however, Pm withholds his share of the sum, then Pi will have one random 
share, s\{x), of A(x), and Pm will have a corresponding share, sm{x); namely, 
si(:r) + sm{x) — A(x) mod q. It is then proposed that the two players execute 
the generic secure circuit evaluation of |21j in order to verify whether 



A{x) :— supp(x) — sN 



M 

(supp m (x) ~ sN m ) > . 



(4) 



m—1 



(si(x) + Sm(x)) mod q < q/2 . 



(5) 



Those circuit evaluations may be parallelized for all 1 £ C s l . 
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We observe that inequality ([3]) holds if and only if 



Sl (x) e 0(x) := {(j - s M {x)) mod q : < j < [{q - l)/2j} . 



(6) 



As si(a:) is known only to Pi while 0(x) is known only to Pm, the verification 
of the set inclusion in ^ can theoretically be carried out by means of Protocol 
SetInc. However, the ground set Q in this case is Z 9 , which is typically a large 
set. (Recall that when Protocol SetInc is invoked from UniFI, the ground set 
fl is Zm+i, which is typically a small set.) Hence, Protocol SetInc is not useful 
in this case, and, consequently, Yao's generic protocol remains, for the moment, 
the simplest way to securely verify inequality ([5]). Yao's protocol is designed for 
the two-party case. In our setting, as M > 2, there exist additional semi-honest 
players. (This is the assumption on which Protocol SetInc relies.) An interesting 
question which arises in this context is whether the existence of such additional 
semi-honest players may be used to verify inequalities like ([5]), even when the 
modulus is large, without resorting to costly protocols such as oblivious transfer. 

4 Identifying all (s, c)-association rules 

Once all s-frequent itemsets are found {F s ), we may proceed to look for all (s, c)- 
association rules (rules with support at least sN and confidence at least c). For 
X, Y G F s , the corresponding association rule X => Y has confidence at least c 
if and only if supp(X U Y)/supp(X) > c, or, equivalently, 



Since |Cx,y| < N, then by taking q = 2N + 1, the players can verify inequality 
0, in parallel, for all candidate association rules, as described in Section [3] 

5 Related work 

Previous work in privacy preserving data mining has considered two related 
settings. One, in which the data owner and the data miner are two different 
entities, and another, in which the data is distributed among several parties who 
aim to jointly perform data mining on the unified corpus of data that they hold. 

In the first setting, the goal is to protect the data records from the data 
miner. Hence, the data owner aims at anonymizing the data prior to its release. 
The main approach in this context is to apply data perturbation [118] . The idea is 
that the perturbed data can be used to infer general trends in the data, without 
revealing original record information. 

In the second setting, the goal is to perform data mining while protecting the 
data records of each of the data owners from the other data owners. This is a 
problem of secure multi-party computation. The usual approach here is crypto- 
graphic rather than probabilistic. Lindell and Pinkas |14j showed how to securely 





m— 1 
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build an ID3 decision tree when the training set is distributed horizontally. Lin 
et al. [T3] discussed secure clustering using the EM algorithm over horizontally 
distributed data. The problem of distributed association rule mining was stud- 
ied in [20 22 in the vertical setting, where each party holds a different set of 
attributes, and in [TJ] in the horizontal setting. Also the work of [T5] considered 
this problem in the horizontal setting, but they considered large-scale systems 
in which, on top of the parties that hold the data records (resources) there are 
also managers which are computers that assist the resources to decrypt mes- 
sages; another assumption made in |18j that distinguishes it from [12] and the 
present study is that no collusions occur between the different network nodes — 
resources or managers. 

6 Conclusion 

We proposed a protocol for secure mining of association rules in horizontally 
distributed databases that improves significantly upon the current leading pro- 
tocol |12j in terms of privacy and efficiency. One of the main ingredients in our 
proposed protocol is a novel secure multi-party protocol for computing the union 
(or intersection) of private subsets that each of the interacting players hold. An- 
other ingredient is a protocol that tests the inclusion of an element held by one 
player in a subset held by another. The latter protocol exploits the fact that 
the underlying problem is of interest only when the number of players is greater 
than two. 

One research problem that this study suggests was described in Section |3j 
namely, to devise an efficient protocol for set inclusion verification that uses the 
existence of a semi-honest third party. Such a protocol might enable to further 
improve upon the communication and computational costs of the second and 
third stages of the protocol of [T^] , as described in Sections [3] and HI Another 
research problem that this study suggests is the extension of those techniques to 
the problem of mining generalized association rules [19] . 
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